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Creating Vocabulary Item Types That Measure Students' 
Depth of Semantic Knowledge 

Paul Deane, Rene R. Lawless, Chen Li, John Sabatini, Isaac I. Bejar, & Tenaha O'Reilly 

Educational Testing Service, Princeton, NJ 


We expect that word knowledge accumulates gradually. This article draws on earlier approaches to assessing depth, but focuses on one 
dimension: richness of semantic knowledge. We present results from a study in which three distinct item types were developed at three 
levels of depth: knowledge of common usage patterns, knowledge of broad topical associations, and knowledge of specific conceptual 
relationships. We attempted to avoid common sources of variance across items (such as attractive distracters) and hypothesized that 
the item types that required greater depth of semantic knowledge would tend to show greater difficulty and discrimination after other 
sources of variance were accounted for. Our results, while still exploratory, support the conclusion that the item types measure different 
aspects of lexical knowledge, consistent with the hypothesis of increasing semantic depth. 

Keywords Word knowledge; common usage patterns; topical associations; conceptual relationships 
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Overview 

The Need for More Sophisticated Measures of Vocabulary 

Vocabulary is well recognized as an essential component of reading proficiency (Beck & McKeown, 1991; Carroll, 1993; 
Cunningham & Stanovich, 1997; Daneman, 1988; Hirsch, 2003; Perfetti, 1994) with correlations between vocabulary and 
reading comprehension assessments ranging from .6 to .7 (Anderson & Freebody, 1981). While the importance of vocab¬ 
ulary development is apparent to researchers and practitioners, the state of the art in vocabulary assessment tends to have 
a strong summative or clinical focus: Most reading vocabulary tests consist of a small sampling of words that vary in 
familiarity and a task that requires choice of a synonym or definition (e.g., Sheehan, Kostin, & Persky, 2006), and most 
clinical receptive/productive tests require the examinee to respond to a picture prompt with a verbal label (e.g., Peabody 
Picture Vocabulary [Dunn & Dunn, 1997]), or vice versa (e.g., Boston Naming Test [Kaplan, Goodglass, & Weintraub, 
1978]). Both classes of vocabulary tests are broad measures that typically can be administered once or twice a year to 
estimate overall vocabulary growth. They tend not to be designed for many classroom-based, formative purposes. The 
cognitive and psycholinguistic literature, however, supports a richer set of vocabulary construct distinctions (e.g., esti¬ 
mating aspects of breadth, depth, and word learning skills [i.e., Chabot, Petros, & McCord, 1983; Daneman & Green, 
1986; Dixon, LeFevre, & Twilley, 1988; Durso & Shore, 1991; Hogaboam & Perfetti, 1975; Hu & Nation, 2000; Sahhouse, 
1993; Stanovich, West, & Harrison, 1995; Swanborn & De Glopper, 1999; Walczyk & Raska, 1992]), but these are rarely 
incorporated into existing standard instruments in a systematic way. For the purposes of the present study, we focus our 
attention on the depth of knowledge about specific word meanings as one of the most important (and clearly delimitable) 
aspects of vocabulary knowledge. Our goal is to develop a more sophisticated understanding of how to conceptualize and 
assess this particular aspect of word knowledge. 

Word Learning and Depth of Vocabulary Knowledge 

As word learning proceeds, meanings of words grow richer over time. Perfetti and Hart (2001) described word knowledge 
as a complex assemblage of representations that vary both in the information they contain and in the degree to which they 
have been fully specified (i.e., in terms of orthographic, phonemic, syntactic, and semantic quality), which Perfetti and 

Corresponding author: P. Deane, E-mail: pdeane@ets.org 


ETS Research Report No. RR-14-02. © 2014 Educational Testing Service 


1 


P. Deane ef at. 


Creating Vocabulary Item Types 


Hart referred to as the lexical quality hypothesis. Consistent with the lexical quality hypothesis, we expect that the normal 
course of development is one in which the meaning of a word is initially totally unknown and then gradually becomes 
more fully specified with continued experience. 

Aligned with the topic of word learning are processes of partial word learning, driven in part by the effects of exposure 
to a variety of written texts. A number of theorists have outlined stages of word meaning that postulate differing degrees 
of depth of word knowledge acquired piecemeal. For example. Dale and O’Rourke (1986) postulated four stages of word 
learning: Stage I, the word is completely unknown; Stage II, implicit word knowledge; Stage III, partial knowledge but mas¬ 
tery in some contexts; and Stage IV, full mastery across a range of uses. Stahl (1986) outlined a similar (three-stage) theory, 
which is applied in Brown, Frishkoff, and Eskenazi (2005) to the task of automatically generating questions designed to 
probe different aspects of vocabulary depth. Brown et al. (2005) primarily used WordNet semantic relationships to gener¬ 
ate definition, synonym, antonym, hypernym, hyponym, and cloze questions. In their discussion, they characterized these 
tasks as primarily providing evidence for the middle level of Stahl’s (1986) hierarchy. 

Designing items to Measure Depth of Vocabulary Knowledge 

If we focus on one aspect of the lexical quality hypothesis, we can identify a number of (somewhat separable) aspects of 
the representation of word meaning that correspond reasonably well to the kinds of inferences theorists have proposed for 
incremental word learning. The following list suggests the kinds of progressions we might see, if depth of semantic knowl¬ 
edge corresponds to the development of an increasingly rich and interconnected representation, driven by associative and 
inferential processes (cf.. Beck, McKeown, & McCaslin, 1983; Carroll & White, 1973; Fukkink, Henk, & De Glopper, 2001; 
Graves, 1986; Nagy, Anderson, & Herman, 1987; Nagy & Scott, 2000; Schatz & Baldwin, 1986; Schwanenflugel, Stahl, & 
McFalls, 1997). That is, if we start with the assumption that the normal path to semantic knowledge starts with exposure 
to usage, involves inferential processes, and results in the gradual consolidation of a semantic/conceptual representation 
integrated with background knowledge, then we might be able to measure this process at the following levels: 

1. Familiarization with patterns of usage. Experience with words, whether orally or in print, necessarily corresponds 
to some degree of familiarization with the contexts in which the word appears, and thus to characteristic patterns 
of usage. Psychologically this corresponds to the development of perceptual traces, whether or not the student has 
developed more truly semantic associations (and thus attention to some of the kinds of associations that form the 
focus of such models as Bates & MacWhinney, 1989). 

2. Development of appropriate semantic memory representations. A somewhat richer form of semantic knowledge 
comes into play when we consider priming effects and other aspects of fast lexical access involving semantic mem¬ 
ory. These kinds of processes embody some degree of generalization about words, in which similar words tend to be 
accessed together, without necessarily entailing full conceptual understanding (McKoon & Ratcliffe, 1998; Myers & 
O’Brien, 1998). 

3. Development of appropriate conceptual representations. An even richer form of semantic knowledge involves what we 
might term definitional knowledge (i.e., being able to map from purely verbal to conceptual representations). Such 
knowledge is critical for various forms of reasoning, such as identifying broader or narrower categories (hypernyms 
and hyponyms) and drawing logical inferences (as in Fellbaum, 2010). 

4. Consolidation of conceptual representations with world knowledge. Finally, a truly deep and complete understanding 
of what words mean may require reasoning that integrates purely semantic knowledge with encyclopedic under¬ 
standings of the subject matter being addressed, involving the kind of processes that reach beyond purely linguistic 
representations discussed in Elman (2009). 

Ideally, we would want a vocabulary assessment to be able to support the kinds of qualitative judgments about the richness 
of vocabulary judgment that people are able to make. 

There is not, necessarily, a guarantee that all people will develop semantic knowledge in the sequential order noted in 
the list above. That order is most plausible for vocabulary acquired implicitly, either as part of daily experience or through 
reading. Any vocabulary that is acquired explicitly—perhaps by memorizing dictionary definitions — might follow a dif¬ 
ferent sequence. But it seems reasonable to assume — at least as a first approximation, all other things being equal—that we 
could expect people with implicit vocabulary knowledge would have less trouble making judgments about Item 1 in the list 
above than about Item 2, about Item 2 than about Item 3, and about Item 3 than about Item 4. This is a testable hypothesis, 
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although one that involves serious methodological issues, since the discrimination and difficulty of a vocabulary item can 
be driven by a wide range of factors (e.g., distractors employed). 

Essentially, we are proposing the development of vocabulary items designed specifically to tap different aspects of depth 
of vocabulary knowledge. The approach we are following is analogous to the kind of trajectory described in Embretson 
(1998) and Embretson and Gorin (2001), in which the design is driven by cognitive psychological principles and in 
which features of the item design are chosen precisely to minimize the effects of construct-irrelevant factors. Prior studies 
have shown that various features tend in general to affect the difficulty of vocabulary items, including word frequency, 
abstractness and imageability of word meanings, and age of acquisition (Bird, Howard, & Franklin, 2003; Breland & 
Jenkins, 1997; Breland, Jones, & Jenkins, 1994; Carroll, 1970, 1971,1976,1993; Carroll & White, 1973; Coltheart, Laxon, 
& Keating, 1988; Gernsbacher, 1984; McFalls, Schwanenflugel, & Stahl, 1996; Paivio, 1971; Zevin & Seidenberg, 2002). A 
variety of factors beyond vocabulary have been implicated specifically in the difficulty and discrimination of vocabulary 
and reading items (Freedle & Kostin, 1992; Gao & Rogers, 2007; Gorin, 2005; Sheehan & Ginther, 2001; Sheehan, Ginther, 
& Schedl, 1999; Sheehan & Mislevy, 1990, 2001; Sheehan et al., 2006). This fact suggests the possibility of designing 
items — and controlling their discrimination and difficulty—by careful consideration of the aspects of knowledge about 
which information is to be obtained and by equally careful manipulation of variables that will have appropriate effects 
on the reasoning processes of people completing the resulting items. These kinds of considerations can have a significant 
impact on the validity of items (Rupp, Feme, & Choi, 2006). But if successful, an item design that is carefully controlled 
may be able to yield valid evidence about different aspects — in this case—of the construct of vocabulary knowledge, 
and therefore enable the development of vocabulary measures that are much more sensitive to the presence of partial 
vocabulary knowledge. Scott et al. (2008) develop a similar research program, though they use control of distractor 
choices, rather than development of distinct item types, to measure different aspects of depth of vocabulary knowledge. 

Thus, the current study can be placed in a larger context in which a cognitively motivated assessment design is proposed 
and then validated by examining whether items built to that design can be modeled using appropriate and cognitively 
motivated features. In this particular case, we expect that if we develop items specifically intended to measure a par¬ 
ticular level of depth of vocabulary knowledge, and control for other factors, we will observe patterns of difficulty and 
discrimination that correspond to the level of depth of vocabulary knowledge that each item type requires. 

The study reported here is intended primarily as a proof of concept, in which we went through each step of a cognitively 
motivated design and validation process: creating items to measure different levels of depth of vocabulary, observing item 
parameters in a field study, and then modeling those parameters in a cognitive model that uses observable features of the 
item (including word frequency, but going well beyond this to include other empirical information about the words used 
in each item) to predict item parameters. 


Instrument Development 

Three types of items were developed for and tested in this study— each with specific goals in mind to test different types, 
and hopefully depth, of vocabulary knowledge. These types included an idiomatic associates item type, intended to 
measure familiarity with patterns of usage; a topical associates item type, intended to measure the kinds of associations 
represented in semantic memory; and the hypernym item type, designed to measure access to conceptual representations 
and associated patterns of inference. 1 All three item types take the form of three-option multiple-choice questions 
containing one correct answer (key). The design of each item type was intended to avoid unnecessary sources of difficulty 
(such as attractive distractors) as much as possible, so that one could reasonably argue that success in answering the 
question demonstrated a control of the relevant kind of lexical knowledge about the targeted word. The format used and 
the specific words selected as part of each design were limited to meet specific design targets such as the overall frequency 
and probability of co-occurrence with the targeted word in a large corpus of edited English texts. 

The Idiomatic Associates Item Type 

The idiomatic associates item type is designed to test students’ knowledge of the typical phrasal patterns characteristic of 
the targeted words. Because it is intended to test this level of knowledge, and no more, the design was motivated by the 
need to ensure that someone could answer the item correctly based only on implicit knowledge of common usage (e.g., 
co-occurrence). Figure 1 illustrates this item type. 
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a. purpose 

b. task 

c. question 

Figure 1 A sample idiomatic associates item (word being tested: undertake). 

launch, conduct, complete 

a. relieve 

b. reject 

c. undertake 

Figure 2 A sample topical associates item (word being tested: undertake). 

The development of this item type was governed by the following design decisions: 

• The prompt takes the form of a cloze sentence-completion, multiple-choice item. The word being tested is not one of 
the options. Instead it appears in the stem, just before or just after the blank, in order to make it possible to contrast 
judgments about the three different contexts expressed in the options. As much as possible, nothing in the stem 
other than the targeted word cues the correct answer. 

• The key is a natural, idiomatic, and relatively frequent collocate of the targeted word in the context of the sentence 
presented in the prompt. 

• There should be as little difference between the key and the other options as possible, except their plausibility in 
the context supplied. Thus, the key and the options should belong to the same part of speech and be approximately 
equal in frequency. 

• More specifically, one of the distractors should be so unusual a usage such that it is ungrammatical, distinctly odd¬ 
sounding, or awkward. The other distractor should be plausible in context (if only meaning is taken into account) 
but should only occur rarely in that context. 

We were able to enforce these design constraints by drawing upon corpus data about word frequency, co-occurrence, 
and patterns of co-occurrence, using the Lexile/SourceFinder corpus, a large (462 million word) corpus of edited English 
texts and word frequency information from the Touchstone Applied Science Associates (TASA) corpus (Zeno, Ivens, 
Millard, & Duwuri, 1995). As long as these constraints are followed, our expectation is that anyone who has heard the 
targeted word frequently enough could recognize the key by recall from perceptual memory alone, without needing to 
access semantic information. 

The Topical Associates Item Type 

The topical associates item type is designed to test students’ knowledge of the kinds of associations that reflect fast lexical 
access processes and thus support semantic priming. As a result, the major design constraint is the need to provide just 
enough information to strongly and unambiguously activate a single topic or concept. Once activated, the association 
between that concept and the targeted word should be obvious to anyone who has an appropriate representation of the 
targeted word in semantic memory. Figure 2 illustrates this item type. 

In order to make sure that the intended association is clear, three stimulus words are presented. The three words are 
supposed to be strongly associated with one another and with the targeted word—but not with the other two options. 
Thus, if the intended associations are available in semantic memory, the student will be able to identify the correct answer, 
even if he or she does not have any deeper conceptual understanding of the target. 

The development of this item type was governed by the following design decisions: 

• The three stimulus words must belong to the same part of speech. 

• The three options must belong to the same part of speech (but may be a different part of speech than the stimulus 
words). 
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To undertake something is to_it. 

a. begin 

b. continue 

c. notice 

Figure 3 Sample hypernym item (Word tested: undertake). 

• There must be a relatively strong association between the stimulus words and the targeted word as measured by a 
mutual information statistic. 

• The mutual information between the stimulus words and the other options must be much lower. 

• The stimulus words must not be synonyms or hypernyms of the targeted word. 

• The stimulus words must not be strong collocates of the targeted word. 

• The stimulus words should not be less frequent than the targeted word. 

• The distractors should be at least as frequent as the targeted word. 

Once again we were able to enforce these design constraints by drawing upon the TASA word frequencies and Lex- 
ile/SourceFinder corpus data. 

Most of these constraints were intended to maximize the likelihood that the three stimulus words would prime the 
key but would not prime either of the other two options. A few of them were designed to rule out alternative paths to 
a correct answer. Our expectation is that as long as these constraints are upheld, anyone who can answer the questions 
correctly will have demonstrated that they have appropriate representations of the stimulus and targeted words in semantic 
memory. 

The Hypernym Item Type 

The hypernym item type is designed to test whether students have sufficient access to conceptual representations associ¬ 
ated with a word to be able to make basic definitional inferences, primarily (but not exclusively) to what are sometimes 
referred to as hypernym relations (Fellbaum, 2010). In the case of nouns, this involves the ability to recognize the broad 
meaning or category to which the targeted words belong. In other cases (e.g., verbs), the item type as we defined it may 
involve the most prominent causal inferences or (in the case of adverbs) the broad category to which the nominal form 
of the word belongs. Figure 3 illustrates this item type. 

The development of this item type was governed by the following design decisions: 

• Like the idiomatic associates, instances of this item type take the form of cloze sentence-completion items. 

• The targeted word is contained in the stem, in order to contrast judgments about the three possible hypernym 
relationships expressed in the options. Nothing in the stem other than the targeted word should provide a cue to the 
correct answer. 

• The options should contain words that plausibly could fit in the blank, belong to the same part of speech, are approx¬ 
imately the same frequency as the targeted word, and are more or less at the same broad level of abstraction as the 
key. 

• The key makes the sentence a true statement that partially defines the targeted word. 

• The two distractors, when placed in the blank in the stem, should produce sentences that are not true as definitions, 
even if they might be contingently true in some situations. 

We were able to enforce these constraints in part by setting thresholds on a number of natural language process¬ 
ing (NLP)-derived features: TASA and Lexile/SourceFinder word frequencies and WordFit similarities (Deane, 2003, 
2005; Deane & Higgins, 2006) that indicate whether words tend to appear characteristically in the same u-gram 
contexts. 

These constraints were intended to eliminate possible routes to a correct answer other than access to a definitional 
understanding of the targeted word. As the item is designed, it is not necessary to have a deep conceptual understanding 
of the word’s meaning; all that is required is sufficient knowledge and understanding to make the correct inferences. Thus 
nothing in this item entails that someone who gets a hypernym item correct will be able to provide exact definitions of 
the word or produce the word only in semantically appropriate contexts. 
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Selection of Vocabulary 

The main study required the selection of vocabulary targeted for instruction in the middle grades, so that many students 
would be likely to have partial vocabulary knowledge for these words and relatively few would have achieved the maximal 
level of depth (i.e., the full consolidation of lexical knowledge and its integration with appropriate, associated knowledge 
of the world). We therefore developed items for 50 general academic words selected from the vocabulary taught in middle 
school as part of the Word Generation vocabulary intervention (Snow, Lawrence, & White, 2009). We developed idiomatic 
associates, topical associates, and hypernym items for each type and a semantic associates item type intended to test the 
maximal depth of vocabulary knowledge. 

These item types were field-tested in June 2009 by inclusion in the posttest for a study administered by the Word Gen¬ 
eration research group. This test was administered to 2,825 students in 14 middle schools in an urban, New England 
school district. Test forms were assembled, each containing an anchor test of 50 Word Generation multiple-choice syn¬ 
onym items followed by one of 20 test forms comprised of 10-12 homogenous groups of the newly developed items that 
were randomly distributed among the students. Mean scores were calculated for each item type and by individual items 
to examine the estimated difficulty for this population. An item analysis was conducted and specific statistics were exam¬ 
ined for inclusion/exclusion on the current study: the TASA standard frequency index (SFI), 2 item proportion correct 
(P+), point-biserial correlation (r pb ), and option-choice frequencies. These results enabled problematic items to be iden¬ 
tified and revised prior to the main study and led to the rejection of the semantic associates item type as too difficult and 
unreliable for inclusion in this study. The use of the Word Generation items for the field test had the advantage that the 
words had been targeted for instruction, increasing the probability that students would have at least a partial knowledge 
of the words, which diminished the risk that a word might turn out almost universally unknown for the student popula¬ 
tion. Since we were using this data collection primarily to identify items in need of revision, the advantages of this venue 
outweighed the potential limitations. 

For the main study, students were involved who had not been specifically targeted to learn these words. It was necessary 
to reduce the total number of words from 50 to 20, and we desired to do so in a way that would maintain a balanced set of 
items after word frequency and difficulty was taken into account. Words were therefore first sequenced in ascending order 
using the P+ values on the SERP Word Generation synonym items from the pilot study. They were then clustered into five 
ranges of difficulties: hard (less than .40), high-medium (.40-.49), medium (.50-.59), low medium (.60-.79), and easy 
(greater than .80). Next, the TASA SFIs 3 were looked up for each word. These word frequencies were computed from an 
analysis of more than 60,000 samples of printed text encountered by students in American schools, which is a good metric 
to use to ensure that the vocabulary clusters (based on P+) chosen for this study were approximately equivalent in terms 
of approximate difficulty and, based on an external measure, not influenced by the variability that can occur in student 
performances. To narrow down the list to a total of 20 words to be tested with the new item types, the point-biserial 
correlations were examined across item types for values of r pb < .15, and option-choice frequencies were examined with 
the actual items to look for double-keys or distractors that may have been overly attractive. Preference was given to words 
for which all item types had acceptable point-biserial correlations and no evidence of double-keying. The final selection 
of words was made to sample across a wide range both of P+ values and word frequencies. 


Study Design 

The Experimental Instrument 

The goal of the study was to explore whether the use of the three different types of multiple-choice items could reveal 
distinguishable levels of depth of vocabulary understanding from middle school students. A within-subjects design was 
incorporated in order to measure students’ performances on all three item types for the same set of academic vocabulary 
words. Items were created for two sets of 10 vocabulary words to which students were exposed to one set of 10 or the 
other; hence, each student was exposed to each word within his or her assigned set three times in the context of the three 
item types. Because each item type required a different type of thinking, the tests were assembled such that students would 
be exposed to one item type at a time in an effort to reduce their cognitive load (i.e., the 10 homogenous items for each 
type were presented together). To detect any learning effects that may have been caused by multiple exposures of the 10 
vocabulary words, the three sets of item types were assembled in six different sequences, as exemplified in Table 1. 
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Table 1 Test Assembly and Test Form Assignment of the Experimental Items in the Within-Subjects Study 


Word set 1 form number 

Word set 2 form number 

Test Section 1 

Test Section 2 

Test Section 3 

001 

007 

Item Type B 

Item Type A 

Item Type C 

002 

008 

Item Type B 

Item Type C 

Item Type A 

003 

009 

Item Type A 

Item Type B 

Item Type C 

004 

010 

Item Type A 

Item Type C 

Item Type B 

005 

011 

Item Type C 

Item Type A 

Item Type B 

006 

012 

Item Type C 

Item Type B 

Item Type A 

Note: Item type key: A = 

topical associations; B = idiomatic associates; C = hypernym. 



Table 2 Tested Vocabulary and Item Sequencing for the Three Different Item Types 






Position in set 


Vocabulary word 

Word set 

Item type A 

Item type B 

Item type C 

Adequate 

1 

10 

1 

2 

Circumstances 

1 

5 

9 

9 

Concept 

1 

3 

2 

3 

Distribution 

1 

6 

8 

1 

Eliminated 

1 

7 

6 

4 

Explicit 

1 

1 

5 

7 

Intrinsic 

1 

4 

10 

10 

Invoked 

1 

2 

7 

8 

Paralyzed 

1 

8 

4 

6 

Undertake 

1 

9 

3 

5 

Attained 

2 

3 

5 

1 

Capacity 

2 

10 

3 

2 

Outweigh 

2 

8 

2 

3 

Generate 

2 

7 

6 

4 

Compatible 

2 

2 

8 

5 

Regime 

2 

4 

10 

6 

Regulate 

2 

1 

4 

7 

Acquired 

2 

6 

1 

8 

Incentives 

2 

9 

9 

9 

Enforced 

2 

5 

7 

10 


Between each of the test sections, the placement of items (for the same words) was sequenced by random assignment, 
as can be seen in Table 2. This measure discouraged students from easily referring back to past items for information about 
the same word. 

While we counterbalanced the test design to detect any ordering effects across the three depth item types, the most 
important feature of the design is the fact that each student was required to answer all three depth items for each tested 
word. This allowed us to place all three item types on a common scale and directly compare item characteristics across 
item types. 

An important feature of the instrument was the inclusion of an anchor set of 20 items — to allow for the computation 
of a covariate for ability. The items in this section of the test were selected from the Word Generation synonym items for 
20 of the words from the original set that were not tested using depth items. These items also were selected to maximize 
range of coverage and reliability of the individual items, using the field test data from the June 2009 study to estimate item 
characteristics. This section of the test was always administered as the final portion of the test. 

Participants and Testing Conditions 

The targeted population was comprised of students in Grades 7 and 8 from across the United States from six urban, six sub¬ 
urban, and eight rural schools from Alabama, Arkansas, Arizona, California, Connecticut, Georgia, Iowa, Idaho, Illinois, 
Indiana, Kentucky, Nevada, and Tennessee. A total of 1,449 seventh grade and 1,622 eighth grade students participated in 
this study. Parental consent was obtained for every student, and schools were paid $10 per completed test returned. 
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TA-generate 
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0 
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4.39 


11.52 
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100 

80 

60 

40 

20 

0 

0.00 

3.10 

85.61 

11.17 

0.12 

0 
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|ta = 23.40 p+ = o .86 
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100 


H-generate 

80 


71.06 


60 




40 

20 

0.14 


16.87 11.80 

m o 14 

0 



0 

1 

2 3 4 


|ta = 35.78* P+ = 0.71 

p£>= 28.08 c pb = 0.44 

|tc= 26.57 


100 

80 


H-generate 

72.33 

60 




40 

20 

0 

0.00 


0.25 

0 

1 
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Figure 4 Option-choice frequencies for targeted word, generate. 


Paper-and-pencil tests were administered to all students with accompanying Scantron answer sheets. Test forms were 
spiraled throughout each school, each grade, and each classroom to maintain the random design and ensure that all test 
forms were tested by approximately the same number of students. Teachers were given explicit administration instructions 
that included sample test questions and the relevant item instructions, which were to be reviewed with their students in 
advance. The teachers were also requested to accommodate any students who needed additional time to finish the test as it 
was not speeded, and they were reminded that their class would only earn the financial incentive for completed tests. They 
were also instructed to encourage students to provide their best guesses to questions they may find challenging, when the 
students were not absolutely certain of the answers. 


Results and Initial Analysis 

Initial Item Analyses 

A routine item analysis was run on each item that was administered to examine each item’s proportion correct (P+, point- 
biserial correlation (r pb ) and option-choice frequencies, for which histograms were generated and mean scores of students 
selecting each option were calculated (Figure 4). 
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generate 



TTL Score 

Figure 5 Sample plot of total test score by item (P+) for all students. 


generate--Grade7 



- IA 



TTL Score 

Figure 6 Sample plot of total test score by item (P+) for seventh grade students. 


A close examination of the option-choice frequencies was performed to determine whether instances of attrac¬ 
tive distractors or double-keys occurred. These items were coded as such. As a reminder, these item types were 
designed to assess whether or not students possessed a particular word sense within each item type. It was there¬ 
fore imperative to make certain that the distractors were designed to purposely not be subtle; students either knew 
the word at a topical (i.e., superficial) level or they did not. This is why the distractors were designed to be obvi¬ 
ously incorrect—in order to separate those students who have a good sense of the targeted word in the presented 
context from those who didn’t. If 25% or more of the students selected a particular distractor, it was judged to be 
an attractive distractor. If the key was chosen less frequently than any given distractor, then it was judged to be a 
double-key. 

For all students and for each grade, plots were created that superimposed line graphs of the three item types for each 
word of total test scores by item P+ values. As shown in Figures 5-7, plots were created to example the patterns created 
by the correct responses. The plots provide a visualization to roughly compare the item difficulty by item type and to allow 
for the identification of the middle scores where scores for students with partial knowledge could be identified for later, 
closer examination. 

We hypothesized that for students possessing partial knowledge, the different item types may show different levels of 
depth of understanding. For example, an examination of the plots for the word generate all demonstrate a similar pattern: 
Hypernyms appear to be the hardest item type, followed by topical associates, with idiomatic associates appearing to 
be easiest. However, closer examination shows that although these three curves approximately track each other, there 
is a total score where the low-ability students can be separated out from the middle-ability students, and similarly, the 
middle-ability students from the high-ability students. It is these middle-ability students (malleable middle) that most 
interest us as we posit that the high-ability students will understand the targeted words regardless of the context of all 
three item types and the low-ability students may not understand or be familiar with the targeted words at all; so, while 
looking at the overall performance pattern for this word is helpful, it is the examination of Figures 6 and 7 that may be 
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generate--Grade8 



TTL Score 

Figure 7 Sample plot of total test score by item (P+) for eighth grade students. 


Table 3 Sample Table of Item Response Theory (IRT) Parameters for the Targeted Word, Generate 


Item type 


Grade 7 



Grade 8 


a 

b 

c 

a 

b 

c 

Topical associate 

1.0564 

-1.3864 

0.3447 

0.9949 

-1.2915 

0.2668 

Hypernym 

0.8097 

0.1761 

0.3872 

0.5103 

-0.0688 

0.2992 

Idiomatic associate 

0.8999 

-0.3485 

0.2969 

0.8359 

-0.6683 

0.2699 


Note, a = parameter for meaningful comparison of the discrimination, b = parameter for the item difficulty, c = the guessing parameter. 


revealing. In this case, an examination of Figure 6 reveals that the seventh grade students with partial knowledge of the 
word , generate seem to have total test scores between 14 and 38; and in Figure 7, the curves begin to flatten out sooner as 
80 percent of the eighth grade students appear to understand the idiomatic uses of the words and topical associations at 
lower total score points. This finding is important for the subsequent analyses that will be explained later in this section. 
One might also infer from these figures that in the case of the word generate the hypernym item may have elicited the 
deepest knowledge; the topical associate, midrange knowledge; and the idiomatic associate, the most superficial. It is for 
this reason that a closer examination of the performance of all of the items needed to be examined so hypotheses could 
be drawn and tested. 


Item Response Theory Analyses 

Because the sample size was adequately large and because we had an anchor test administered to all students, we 
were able to equate the tests between the two different sets of words (as shown in Table 2) that were administered 
and the two different grades and run a three parameter item response theory (3PL IRT) analysis. This action allowed 
for a meaningful comparison of the discrimination (a parameter), the item difficulty (b parameter), and guessing (c 
parameter) in standardized scales. The procedure for equating the items was similar to that used for the National 
Assessment for Educational Progress (NAEP), which uses the PARSCALE IRT software (Mislevy, Johnson, & Muraki, 
1992). 

First, the response and scored data were cleaned and sorted by targeted word and item type. Because there were two 
sets of targeted words for two grades, the data were divided into four groups: Grade 7 Set 1, Grade 7 Set 2, Grade 8 Set 1, 
and Grade 8 Set 2. Next, an item analysis program was used to create the input files for PARSCALE (NAEP Version 3.1) 
for each of the four datasets. PARSCALE was used to generate the item characteristic curves (ICCs) for the item responses 
and input files, and the software program TBLT (NAEP Version 2.30) was used to equate all datasets to the Grade 7 Set 1 
parameters. The actual equating process used was as follows: when equating Grade 7 Set 2 to Grade 7 Set 1, the 20 common 
items were used as anchor items; when equating Grade 8 Set 1 to Grade 7 Set 1, all of the 50 items, including common 
items, were used as anchor items because all of the items were the same in the two tests; and when equating Grade 8 Set 
2 to Grade 7 Set 1, the 20 common items were used as anchor items. The resulting parameters were calculated (Table 3) 
and ICCs were generated (Figure 8). 
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Figure 8 Sample three parameter (3PL) item characteristic curves for all item types and both grades. 


Table 4 Means and Standard Deviations for Individual Responses to Each Item Type 


Item type 

Minimum 

Maximum 

Mean number correct 

SD 

Idiomatic associate 

0 

10 

6.87 

1.780 

Topical associate 

0 

10 

7.44 

2.155 

Hypernym 

0 

10 

6.99 

1.976 


Validity Evidence 

When this research commenced, hypotheses were formulated about the nature of the item types that were designed. The 
main idea was that different types of items could be developed to measure students’ partial knowledge of specific words. 
However, we emphasize that the nature of this research is exploratory. In this section, we attempt to explain the rela¬ 
tionships that we observed, but our conclusions come with the caveat that we were only able to test a relatively small 
number of words across item types; therefore, we should not generalize but only infer what was observed. With sub¬ 
sequent data collections, we hope that we will observe similar patterns that will add further evidence supporting our 
theories. 

The hypotheses that motivated the development of the three item types imply certain relationships among them. All else 
being equal, we might expect the idiomatic associates item type to be easier to answer than the topical associates item type; 
that in turn might be expected to be easier to answer than the hypernym item type based upon the kinds of knowledge 
structures each item type is designed to draw upon. But all things are seldom equal, and with a small dataset, the expected 
underlying relationships (if present) may not be easy to tease out. Table 4 shows the surface-level performance of each 
item type, which does not show the order of difficulty among item types one might expect. 

With only 20 cases to analyze, however, the influence of outliers can be large; for instance, two of the 20 idiomatic 
associates items have a P + below that which would be expected by chance, and their inclusion lowers the mean P+ below 
the level displayed by the other two item types. Yet there are other ways to evaluate the validity of the item design. 

To begin with, we may consider that we are postulating a construct—the depth of semantic knowledge of word 
meaning—which is, ex hypothesis, more fully measured the deeper an item probes for the knowledge of a word’s 
meaning. It follows that we would expect (all other things being equal) that the hypernym item type would correlate more 
strongly with an independent measure of this construct than the topical associates item type, and the topical associates 
item type to correlate more strongly with an independent measure of this construct than the idiomatic associates item 
type. This relationship is easily measured with our data, since we have an anchor set of 20 Word Generation synonym 
items, selected to have reasonably high reliabilities with total score, and to span a broad range of difficulty levels. Such 
synonym items require a high degree of semantic knowledge, since the respondent must differentiate between an exact 
synonym and nonsynonym distractors that may be plausibly similar and related to the targeted word. We correlated the 
total score of each item type with the total Word Generation score and obtained the following correlations (N = 3,075), 
which fall in the expected order, as shown in Table 5. 
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Table 5 Correlations Between the Three Item Types and a Separate Measure of the Construct 


Item type Correlation with word generation anchor set 


Hypernym total correct 

.61 

Topical associate total correct 

.58 

Idiomatic associate total correct 

.54 


Note, p < .001. 


Table 6 Discrimination Means and Standard Deviations for the Three Item Types 


Statistic 

Idiomatic associate 

Topical associate 

Hypernym 

Mean as 

.70 

.89 

.94 

SD 

.25 

.29 

.35 

Note, a = parameter for 

meaningful comparison of the discrimination. 




Another piece of validity evidence can be derived by considering the relationship between the mean 3PL IRT parame¬ 
ters. As already pointed out above, we would expect that there would be a relationship between item discrimination and 
difficulty and items that measure deeper aspects of semantic knowledge. The difficulty parameter ( b ), maybe a somewhat 
less clear measure of depth of vocabulary knowledge, since it represents only the inflection point in the IRT model, which 
may be affected by a variety of factors not strongly associated with depth of vocabulary knowledge (e.g., word frequency), 
whereas the discrimination parameter (a) indicates how well an item separates individuals with high levels of vocabulary 
knowledge (who are therefore much more likely to have a deeper knowledge of a particular word) from those with much 
less knowledge (and are therefore likely to have much shallower knowledge of that word). And in fact, if we calculate 
the mean discriminations for each item type, they fall in exactly the expected relationship, as shown in Table 6. Higher 
values indicate the item type better discriminates between higher and lower-ability students (who were defined by their 
performances on the anchor set of Word Generation synonym items). 

A two-way analysis of variance (ANOVA) comparing discrimination differences between item types and across grades 
yielded no main effects or interactions involving the grade in which students took the tests. However, there was a main 
effect with item type F( 2, 5) = 7.197, p < .01, indicating that there were differences in the mean discriminations. A post 
hoc Tukey’s honestly significant difference (HSD) test demonstrated that performances on the idiomatic associates item 
type were significantly less than performances on the topical associates (p = .02) and the hypernyms (p < .001). Although 
the means are different between the topical associates and the hypernyms, it is not a statistically significant difference and 
with this minimal number of items, the pattern can only be deemed as suggestive. 

A manual analysis of the design of items was conducted to investigate how well we adhered to the design specifications. 
Topical associates were constrained by the expectation that all three cue words would be associated with the key and that 
none of them would have strong associations with the distractors. After the initial item analyses were completed, the 
stimuli of 10 random topical associate items were manually coded for how well they met this constraint. This coding 
system was tested on the remaining ten items. The average a parameter for items that were judged to meet the constraint 
perfectly was 0.92. As Table 7 shows, items judged not to meet the constraint perfectly were far more likely to fall below 
the mean. 


Table 7 Distribution of Topical Associate Items by Independent Judgment of Design Fit 


Item type 

Number of items where a < .92 

Number of items with a > .92 

Topical associate items judged not to fully complywith the 

9 

1 

intended design 

Topical associate items judged to closely matchthe intended 

20 

10 

design 




Note. N= 40 as parameters for seventh and eighth grades were calculated separately, a = parameter for meaningful comparison of the 
discrimination. 
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Table 8 Distribution of Idiomatic Associate Items with Attractive Distractors 

Item type Number of items where a < 0.6 Number of items where a > 0.6 

Idiomatic associate items with attractive distractors 2 10 

Idiomatic associate items without attractive distractors 12 16 

Note. N = 40 as parameters for seventh and eighth grades were calculated separately, a = parameter for meaningful comparison of the 
discrimination. 


Table 9 Distribution of Hypernym Items with Attractive Distractors 


Item type 

Number of items where a < 1 

Number of items where a > 1 

Hypernym items with attractive distractors 

7 

1 

Hypernym items without attractive distractors 

18 

14 


Note. N = 4 0 as parameters for seventh and eighth grades were calculated separately, a = parameter for meaningful comparison of the 
discrimination. 


This post hoc manual analysis of the item properties suggests that the differences between the topical associates and 
the hypernyms (shown in Table 6) might have been significantly different if the item design process had been more tightly 
controlled. 

This type of analysis carried over into the other item types. For instance, one of the design constraints for all of the 
item types was to ensure that no attractive distractors were included in the three option choices. However, the initial 
item analysis (described previously) indicated a significant number of idiomatic associates items contained attractive 
distractors. The mean a parameter for the items without attractive distractors was 0.6, and as Table 8 shows, nearly all of 
the items with attractive distractors were above the mean, which suggests that if they had been more tightly constrained, 
the separation between these items could have been even stronger. 

On the other hand, for the hypernym item type, the effects of attractive distractors went exactly in the opposite direc¬ 
tion. Seven of the eight items with attractive distractors were below the mean for hypernym items without attractive 
distractors; thus, it is plausible that if such items had been avoided, the hypernym items would have been more strongly 
separated from the other two item types (see Table 9). 

The net implication of these post-hoc analyses suggests that the differentiation of the item types by discrimination 
could have been even stronger than it was in this study and suggests the hypothesis that strict adherence to the item 
design specifications should replicate the observed effects even more clearly. 

Further Validity Evidence Supported by Natural Language Processing Features 

Additional analyses were conducted to ascertain whether certain NLP features could be used to predict the IRT param¬ 
eters. NLP features are statistics that are calculated from corpus data using computational techniques that are used to 
classify different linguistic characteristics of text numerically. 

Our items were constructed according to strict specifications designed to measure specific types and various degrees 
of semantic knowledge of words. This makes each item type a relatively simple, pure assessment of a specific type of 
knowledge. We determined that an attempt to predict the IRT parameters was far more likely to be successful if the 
vocabulary knowledge followed a gradient of depth and our item types succeeded in their design of measuring different 
levels of depth. 

Since we had 20 items per item type, distributed over two grades, the sample sizes were just large enough to support 
a regression analysis. Given the relatively small number of NLP features we wanted to use in the regression, we avoided 
any model that had a large number of significant predictors. We treated this analysis as purely exploratory, to be tested by 
extension to additional items in subsequent studies. 

We collected a series of NLP features designed to measure aspects of word knowledge, as follows: 

• Word frequencies —These are from the TASA corpus (Zeno et al., 1995) and from the SourceFinder corpus, a 425- 
million-word collection of journal articles and readings appropriate to K-12. 
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• Conditional probabilities — These are based upon how frequently words co-occur in the SourceFinder corpus. This 
is a direct measure of the probability of one word appearing given another word’s appearance within the same 
paragraph. 

• Association cosines — These were calculated from a database developed by this project that identified clusters of 
topically associated vocabulary based upon the same corpus we used previously for word frequencies, which enabled 
us to identify how closely a given word was associated with specific topics. We were able to use this tool to determine 
how closely the words used in the topical associates items conformed to the intended item design specification. These 
cosines indicate whether the pattern of associations in which a word participates is similar to, or different from, the 
general pattern of associations found for the targeted words in our items. 

• Semantic vector measurements. We had relatively easy access to two measures: latent semantic analysis (LSA; Lan- 
dauer, Foltz, & Laham, 1998), and correlated occurrence analog to lexical semantics (COALS; Rohde, Gonnerman, 
& Plaut, 2005). These features measure the general latent tendency of any pair of words to appear in the same doc¬ 
uments; thus a high cosine value between two words indicates that they tend to occur in similar topical, semantic, 
or syntactic contexts. 

• Word fit cosines (Deane, 2003). These resemble LSA or COALS vectors, in that they are built using the same math¬ 
ematical methods (singular value decomposition), but the underlying data is the association between words and 
phrasal contexts. As a result this measure provides estimates of how plausible a word sounds in a phrase, based 
upon corpus data. 

We ran a stepwise regression to attempt to predict the IRT parameters of each item. We then ran separate regressions 
analysis for each item type. The results indicated that we could, in fact, predict item parameters, particularly difficulty 
and discrimination. In general, the models performed at a moderate level. The analyses showed interesting relationships 
among the item types, which will be discussed after we review the individual models. 

Predicting Parameters for the Idiomatic Associates Item Type 

A stepwise regression yielded a model in which discrimination for the idiomatic associates item type was predicted by 
one factor: the LSA cosine between the target word and the key. This model achieved an R of 0.52, an adjusted R 2 of 0.25, 
and a standard error of 0.22. 

The model for the difficulty of the idiomatic associates item type was predicted by two NLP features, each of which was 
positively correlated with difficulty: (a) the WordFit cosine between the target word and the best distractor, indicating a 
distractor that also had some attraction to the sentential context, and (b) the LSA cosine between the target word and the 
best distractor. This model achieved an R of 0.61, an adjusted R 2 of 0.34, and a standard error of 1.02. 

Predicting Parameters for the Topical Associates Item Type 

For the topical associates, a stepwise regression yielded a model in which the discrimination was predicted by a number of 
NLP features: (a) the COALS cosine similarity between the key and the best distractor (indicating that at least one of the 
answers was easily confused with the key); (b) the word frequency; and (c) a feature that coded the pattern of association 
cosines between the three stimulus words and the key. 

This last feature was intended to measure how well the topical associate item fit the design template we had specified 
for this item type, in which all three stimulus words should be associated with the key but with none of the options. If all 
three stimulus words were above a threshold association cosine with the key but none of the options, this feature received 
the value of 0. If only two stimulus words were above the threshold value, the feature value was 1. If only one stimulus 
word was above the threshold value, the feature value was 2. If any of the options were above threshold value for any of the 
stimulus words, the feature value was 3. This provided a stricter definition of our intended item design than was available 
when the items were first constructed, and it allowed us to identify which items deviated from the intended design and to 
quantify how serious that deviation was. 

The model performed moderately well with an R = 0.62, adjusted R 2 = 0.33, and standard error of 0.24 on features (a) 
and (b), which strongly increased discrimination, and a negative coefficient, which represented the failure to match the 
item design as specified in the topic mapping tool. 
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Table 10 Constants in the Regression Equations Predicting Item Response Theory (IRT) Item Parameters for Each Item Type 


Analysis 

Idiomatic associate 

Topical associate 

Hypernym 

Discrimination 

0.43 

0.64 

0.92 

Difficulty 

-2.50 

-0.63 

1.23 


A stepwise regression yielded a model in which difficulty for the topical associates item type was predicted by a one 
feature: how well the item fitted the ideal design specified using the topic tool (failure to do so makes items more difficult). 
This model achieved an R of 0.41, an R 2 of 0.14, and a standard error of 0.62. 

Predicting Parameters for the Hypernym Item Type 

For hypernyms, a stepwise regression yielded a model in which two NLP features predicted the discrimination of the 
hypernyms rendering an R of 0.65, an adjusted R 2 of 0.40, and a standard error of 0.27: 

1. COALS cosine between the targeted word and the best distractor. 

2. COALS cosine between the key and the best distractor. 

In other words, the model predicts that hypernym items will discriminate best to the extent that distractors are rea¬ 
sonably similar to the word/hypernym pair that defines the item. 

A stepwise regression also yielded a model in which the two word frequency measures (the TASA SFI and the Log 
Sourcefinder frequency) combined to predict the difficulty of Flypernyms (with an R of 0.58, an adjusted R 2 of 0.32, and 
a standard error of 0.81). As one would expect, more frequent words were easier; less frequent words, more difficult. This 
result makes sense if we interpret the coefficients as creating a weighted average, using the Sourcefinder frequency to 
discount what appear to be slightly inflated estimates of word frequency in the TASA SFI measure. 

Interpretation of the Regressions 

A striking feature of the models that resulted is that the intercepts (i.e., constants) are consistent with the ordering of 
the three item types by depth in both difficulty and discrimination. On a theoretical basis, we would expect hypernyms 
to require the deepest semantic knowledge of targeted vocabulary words; idiomatic associates to require the least. This 
should correspond, in turn, to hypernyms tending to have the highest difficulties and discriminations, and idiomatic 
associates tending to have the least. This hypothesis is consistent with the regression analyses we have obtained, since 
the constants in the regression analyses (reported above) fall in the predicted order, as shown in Table 10. These results 
provide some confirmatory evidence that the three item types are, in fact, measuring knowledge at different levels of depth 
of knowledge, at least when predictable variance among items of the same type is factored out. What the regressions are 
doing is accounting for sources of error. In the case of the topical associates and idiomatic associates item types, the 
regressions also indicate that we are making these item types more difficult than intended when the design specifications 
are not followed closely. 


Discussion 

This study has a number of features that should to be taken into account before any inferences are drawn about its larger 
implications. First, only a limited number of words could be tested; since these were drawn from a set of academic vocab¬ 
ulary targeted for instruction in the middle school grades, caution should be exercised when making inferences about 
how and whether the results may extend to different types of vocabulary. Second, the population consisted entirely of sev¬ 
enth and eighth grade students in a convenience sample of U.S. schools. Additional studies will be needed to determine 
how and whether these results may vary by population. Finally, a key feature of the study design was that it required each 
student to make multiple judgments about the same word. This feature creates the possibility of priming between items, 
where (for instance) prior exposure to a word could produce cuing effects facilitating answers on subsequent items testing 
the same word. For the purposes of the analyses presented below, we pooled data across orders of presentation and treated 
any differences resulting from order as noise. 4 
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An important implication of this study is that it strongly supports the feasibility of designing vocabulary items to fit 
a cognitive model. Each item type was built from the ground up to measure a different aspect or level of a theoretically 
motivated construct (depth of semantic knowledge). We defined and applied a consistent construction principle for each 
item type, informed by cognitive theory and took advantage of corpus resources to control potential sources of variation. 
The validity evidence suggests that this construction was successful, yielding differences among the item types consistent 
with the theoretical basis on which it was built. 

Our results suggest that we did not achieve this goal perfectly. Some items appear to deviate from the intended model, 
but in ways that tend to confirm that the other items are functioning as intended. We intend to follow up this study with 
additional studies that replicate the results with different words, different populations, and closer control of variables that 
account for variations in item functioning. If these results confirm our initial findings, it maybe possible to define precise 
construction principles for creating vocabulary items designed to measure the depth of vocabulary knowledge. 

Given the design on which each item was based, we would also expect that the differences among item types might 
prove useful for discriminating among different patterns and levels in the acquisition of semantic knowledge. For instance, 
English language learners might acquire a large number of words purely from direct instruction, without the incremen¬ 
tal development that reading large numbers of texts might provide. This might correspond to a shift in the pattern of 
performance across item types. Similarly, if some students had extensive reading experience but less facility in inferring 
conceptual meanings from text, there might be a shift in the relative difficulty of shallow versus deeper item types along 
the continuum we have begun to explore. Such possibilities go well beyond the conclusions that can safely be drawn from 
this study alone, but they suggest a line of research that might fruitfully be explored. 
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Notes 

1 A semantic associates item type, intended to measure the fourth level of depth of vocabulary knowledge did not work well in pilot 
testing, and was excluded from the study. 

2 “When interpreting SFI (Standard Frequency Index) values, note that the SFI statistics form a logarithmic scale, like the Richter 
scale used to evaluate the magnitude of earthquakes. As a result, arithmetic differences in SFI values correspond to geometric 
differences in word frequency” (Zeno et al., 1995, p. 12). 

3 http://www.questarai.com/Products/WordFrequencyGuide/Pages/default.aspx 2-10-11 

4 There do appear to be ordering effects; for two of the item types (topical associates and hypernyms) order was a significant 
predictor, with absolute position in the form accounting for half the variance in form means for the topical associates item type, 
and about one third of the variance in form means for the hypernym item type. These can be interpreted as cuing effects, with 
prior experience with a word facilitating later performance. It will be worthwhile to examine, in future studies, how such effects 
can be controlled or eliminated, perhaps by collecting data from different item types on different test sessions. 
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